Exploration of Different Imputation Methods for Missing Data

Author

Karthik Aerra, Liz Miller, Mohit Kumar Veeraboina, Robert Stairs

Published

July 27, 2024

Slides

Introduction

Missing data is a key challenge that data professionals encounter in their daily work. In the real world, data is often messy, incomplete, and requires a level of pre-processing before data analysis can occur. Part of data pre-processing is dealing with missing data. Missing data can occur for several reasons including but not limited to processing error, machine or measurement errors, non-responses in surveys, invalid calculations (e.g. division by zero), and study participant dropout (Emmanuel 2021). Missing data can be problematic for analysis because it can reduce the number of usable observations, introduce bias, and interfere with different analysis tools such as machine learning algorithms, which cannot automatically handle missing data.

One of the first steps in handling missing data is to determine how much of the data is missing and from which columns. Missing data in a dataset can be represented in several ways including NA, NaN, Null, None, blanks, spaces, empty strings, or placeholder values such as 999999 (or equivalent obvious numerical outlier). For standard missing values such as NAs, functions such as is.na() in R can be used to count the number of missing observations. For string-related missing values such as space (“ ” or empty strings (“”), functions such as unique() in R can be used to identify unexpected strings. For numerical placeholders such as 0, 99999 (or equivalent), or negative values, histograms of the data can be useful to identify missing values.

If a dataset contains missing data, there are two basic approaches: deletion or imputation (Schefer 2002). Deletion of data can be problematic for analysis, especially when the proportion of missing data is high or the overall number of observations is low. Deletion of data can also lead to bias or underestimation of variance if the data are not missing completely at random. Imputation is the process of replacing missing values with estimates, which can be accomplished using statistical or machine learning methods. It is crucial when choosing an approach to missing data that the underlying mechanism(s) for the missing data are understood.

Missing Data Mechanisms

The mechanisms for missing data can generally be broken up into three categories (Mack, Su, and Westreich 2018), (May et al. 2009), (Bennett 2001), (Little et al. 2014), (Du, Hu, and Zhang 2020):

Missing Completely at Random (MCAR):

When data are MCAR, the probability of a record missing is independent from observed and unobserved data. In other words, the likelihood of an observation having missing data is the same across all observations (Buuren 2018). The pattern for missing data is unrelated to the data itself. For example, if a continuous monitoring system periodically loses network connectivity or power, and that loss cannot be attributed to other factors (e.g. random), this missingess pattern can be considered MCAR. In the case of MCAR data, the deletion of data reduces the number of observations in the dataset but does not introduce bias. MCAR is the most ideal, but least realistic scenario. The probability function for MCAR data can be described as follows:

Probability function for MCAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are not dependent on observed data or missing data.

Missing at Random (MAR):

When data are MAR, the probability of a record missing depends on the observed data, but not the unobserved data (Buuren 2018). If performing a survey with age as a variable, women may be more likely to omit their age in the survey. The probability of an observation having a missing data point for age can be therefore be attributed to another observed variable. In the case of MAR, deletion of entire observations with missing data may or may not introduce bias. MAR is typically the underlying assumption for mosy imputation methods. The probability function for MAR data can be described as follows:

Probability function for MAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data are dependent on observed data but not missing data.

Missing Not at Random (MNAR):

When data are MNAR, the probability of a record missing is dependent on the unobserved data. The probability of a missing datapoint is variable, but cannot be accounted for with other known, observed variables. There may be a factor that can predict the probability of a missing value, but it is unknown. MNAR data are the most complicated to analyze. Similarly to MAR, deletion of observations with missing data may or may not introduce bias to the analysis. The probability function for MNAR data can be described as follows:

Probability function for MNAR data, (Buuren 2018)

Where Yobs is the observed data, Ymis is the missing data, and ψ contains the missing data model parameters. The above statement indicates that the missing data could be dependent on observed data or missing data.

Figure 1: MCAR, MAR, and MNAR Example, where M is the probability of a missing data point, X is the observed data, and Y is the unobserved data. (Du, Hu, and Zhang 2020)

Understanding the mechanism behind missing data (MCAR, MAR, or NMAR) is important because many imputation methods require the assumption of MCAR or MAR to be held true. Statistical tests such as Little’s MCAR test, Hawkin’s test and non-parametric test can be used to support whether data or MCAR or not. These tests will be discussed further in the methods section. Visualization of the missing data is also helpful for determining the relationships (or lack thereof) within the missing data. For example, correlation matrix plots, scatterplots, and other custom visualizations can be useful tools in identifying patterns in the missing data (Ghazali, Shaadan, and Idrus 2020).

Techniques for Handling Missing Data

Methods for handling missing data can be split into three broad categories (Ren et al. 2023), (Emmanuel 2021):

Deletion methods:

One of the approaches to handling missing data is to delete instances of missing values. This can be accomplished through either listwise deletion or feature selection, for example. In listwise deletion, all rows with missing values are deleted from the dataset. This results in a dataset with a lower number of observations. In feature selection, columns with high proportion of missing data are deleted from the dataset. In this way, the number of observations in the dataset is maintained, but the total number of features is reduced. Deletion methods are advantageous in that they are quick and simple. With a large dataset and relatively low amounts of missing data, reducing the number of observations or features may be okay. However, deletion can introduce bias into the analysis if the data are not missing completely at random, which is often the case (Ren et al. 2023), (Emmanuel 2021).

Single imputation:

In single imputation, missing data values are replaced with a single value, without consideration of the uncertainty around the prediction for the missing value. In other words, a single value is predicted for each missing data point. This can be accomplished using techniques such as mean imputation, median imputation, hot and cold deck imputation and regression-based techniques. There is not a model defined for predicting the values in the column as a whole, so there is information lost around the variance of the prediction. Single imputation methods are simpler than multiple imputation methods, but they tend to underestimate the variance resulting models may have artifically low confidence intervals (Ren et al. 2023), (Emmanuel 2021).

Multiple imputation:

In multiple imputation methods such as MICE, a model is constructed to predict missing values in each column based on observed data in other columns. This model provides a point estimate for the missing values, as well as the uncertainty around the predictions. Multiple iterations are performed to predict the missing values. In each iteration, a missing value is predicted based on the model and its uncertainty. That is, with multiple iterations, you will end up with a distribution of possible missing values. In this way, multiple datasets with multiple possible values for the missing data are generated. Once the iterations are complete, the missing data predictions can consolidated back into a single dataset, where the missing values will be a result of the point estimate and the variance or uncertainty from the model. Multiple imputation can be a powerful technique that is helpful in preserving some of the variance or uncertainty around missing values, but it can be complex and computationally expensive. (Ren et al. 2023), (Emmanuel 2021).

Imputation methods can be further broken down into statistical versus machine learning methods. According to (Lin and Tsai 2020)’s review of literature from 2006-2017, four of the most common statistical methods for missing value imputation include expectation management (EM), linear regression, least squares (LS), and mean/mode imputation. For machine learning, the four most common methods include clustering, decision trees, random forest, and KNN.

Theories for Missing Data

Rubin’s Missing Data Theory:

This foundational theory categorizes missing data into MCAR, MAR, and NMAR. It provides a framework for understanding the mechanisms behind missing data and guides the selection of appropriate imputation strategies based on these mechanisms(Rubin 2004).

Statistical Theory of Imputation:

This theory involves using statistical models to estimate missing values, leveraging the relationship between observed and missing data points. Multiple imputation, a prominent technique within this theory, generates multiple plausible values for each missing data point to account for uncertainty (Schafer 1997).

Machine Learning Theories:

Machine learning techniques are increasingly employed for imputing missing data due to their ability to learn patterns from data and make predictions. Methods such as k-nearest neighbors (KNN) and neural networks are applied to predict missing values based on observed data patterns (Batista and Monard 2003).

These theories provide a comprehensive framework for understanding and implementing imputation methods across various datasets and analytical contexts.

Research Methods for Missing Data:

Descriptive Studies:

These studies analyze the patterns of missing data and evaluate the effectiveness of various imputation methods. Examples include the work by Jerez et al. and Güzel (Jerez 2010), (Guzel 2013).

Comparative Studies:

These studies compare the performance of different imputation methods. Joseph G. Ibrahim and Siming Zheng have conducted such comparative analyses, highlighting the strengths and weaknesses of various approaches (Ibrahim 2011), (Zhang 2024), (Zheng, Zhang, and Zhou 2023).

Simulation Studies:

These studies test imputation methods on controlled datasets, allowing researchers to systematically observe the impact of different techniques. This helps in understanding the conditions under which each method performs best.

Application-Based Studies:

These studies apply imputation methods to real-world data, demonstrating their practical use and the challenges involved. Examples include the works by Saqib Ejaz Awan and Emmanuel, which show how imputation methods are implemented in practical scenarios (Awan 2022), (Emmanuel 2021).

Theoretical Analysis:

Critical decisions based on statistical theories and useful techniques are involved in addressing missing data in datasets. By categorizing missing data into MCAR, MAR, and MNAR, Rubin’s Missing Data Theory offers a fundamental framework for choosing the best imputation techniques. Statistical Theory of Imputation estimates missing values by using models such as multiple imputation, which preserves uncertainty by making many credible guesses. In contrast, machine learning theories use data-driven insights to forecast missing values based on observed patterns using algorithms like KNN and neural networks. Many ways are available with trade-offs in simplicity, accuracy, and processing needs. Examples of these techniques are deletion methods, single imputation (e.g., mean imputation), and multiple imputation (e.g., MICE). These techniques are validated by comparative and simulation studies, which assess their efficacy in various circumstances and datasets. Comprehending these theories and approaches is essential for minimizing biases and optimizing data utility in analytical tasks, highlighting the multidisciplinary aspect of tackling difficulties related to missing data in practical applications.

Project Purpose

The purpose of this project is to demonstrate the use of several different types of missing data imputation methods using a test dataset with missing data. The performance of various imputation methods will be assessed using RMSE and R-Squared of the resulting predictive models after applying various imputation methods.

Methods

An overview of the methodology used to demonstrate different techniques for imputing missing data is shown below:

Figure 2: Overview of Methodology for Demonstrating Different Techniques for Missing Data Imputation. Figured generated by Robert Stairs, 14JUL2024.

Data were classified as either MCAR or not MCAR using Hawkin’s Test, Non-Parametric Test, and Little’s MCAR Test (Li 2013). The Hawkin’s Test checks for both multivariate normality and homoscedasticity (equal variances), and assumes normality. The Non-Parametric test checks for homoscedasticity without assuming normality. Lastly, the Little’s MCAR test is a chi-squared test that evaluates whether mean differences exists between groups with different missing data patterns. In these tests, the hypotheses are as follows:

  • Null Hypothesis (H0): The data is missing completely at random (MCAR).
  • Alternative Hypothesis (H1): The data is not missing completely at random.

Note, this procedure is only designed to test between MCAR/not MCAR and is not able to classify data as MAR or MNAR. Data visualization is recommended to provide supportive information to support the MAR assumptions.

Missing Value Deletion Methods

The following methods were used to delete missing data for this project:

List-wise Deletion: all rows containing missing values are removed from the dataset

Column Delete with Threshold (Feature Selection): if the total percentage of missing values in a column is over a threshold (20%), then the entire column/feature is removed.

Imputation Methods

The following methods were used to impute missing data for this project:

Mean imputation: missing values are replaced with the mean of the non-missing values of the corresponding feature.

Mode imputation: missing values are replaced with the mode of the non-missing values of the corresponding feature.

Median imputation: missing values are replaced with the median of the non-missing values of the corresponding feature.

MICE (multiple imputation by chained equations): missing data for a given column are imputed based on other columns with observed data over multiple iterations (Azur et al. 2011). It operates under the assumption that the data are missing at random (MAR). For a missing column, the missing values are imputed using regression based on observed data from other columns in the dataset. Then, the next column’s missing data are imputed using observed data in other columns and so forth until regression models are available for each column. Each regression model will have its own level of uncertainty. With each iteration, a possible missing value is imputed for a given data point given the regression function and its associated uncertainty. This process is repeated over multiple iterations, where the number of iterations is set by the user. This results in multiple datasets, which are then combined into a single dataset again. In this case, 50 iterations were used.

Figure 3: Multiple Imputation versus Single Imputation Methods, (Ren et al. 2023)

missForest: uses a random forest trained on observed data to predict missing values in continuous and categorical data. It is an ensemble machine learning method, where the data set is bootstrap sampled randomly and individual decision trees are constructed for each bootstrap (Tang 2017). The collection of decision trees is then used to impute a missing data point. This technique in machine learning helps prevent overfitting that decision trees are prone to.

Machine Learning Methods for Model-Fitting

Once missing values were imputed using each of the five methods outlined above, each resulting dataset (5) was used to fit models for Ozone levels. The datasets were split into training and test sets using a split ratio of 0.75. The following models were then used for training each dataset:

Decision Tree:

Decision trees are a tree-structured classifier used for both classification and regression. It splits the data into subsets based on the value of input features, forming a tree where each node represents a feature and each branch represents a decision rule. Advantages include: Easy to interpret and visualize, can handle mixed-type data, non-parametric making it flexible for various data distributions, requires little data pre-processing. Disadvantages include: Prone to overfitting, sensitive to small variations in the data, can create biased trees if some classes/features are dominant.

Figure 4: Visual representation of decision tree algorithm (https://365datascience.com/tutorials/machine-learning-tutorials/decision-trees/)

Random Forest:

Random forest is an ensemble learning method primarily used for classification and regression. It combines the predictions of multiple decision trees to improve accuracy and control over-fitting. It construct the decision trees during training and outputting the mode of the classes (classification) or mean prediction (regression) of the individual trees. Advantages include: high accuracy and robustness against over-fitting, works well with large datasets and high-dimensional spaces, handles both classification and regression tasks. Disadvantages include: can be computationally intensive and slower to predict, less interpretable compared to a single decision tree.

Figure 5: Visual representation of random forest algorithm (https://medium.com/(denizgunay/random-forest-af5bde5d7e1e?))

K-Nearest Neighbors (KNN):

KNN is a simple, instance-based learning algorithm used for classification and regression. It classifies a data point based based on the stored instances from the training data without learning explicit parameters, but instead on how its neighbors are classified. Advantages include: simple and easy to implement, algorithm doesn’t have a separate training phase making it more flexible. Disadvantages include: Computationally expensive for large datasets due to the need to compute distances to all points, sensitive to irrelevant or redundant features, performance depends on the choice of the number of k and the distance metric used.

Figure 6: Visual representation of KNN algorithm, where k is the number of nearest neighbors (https://towardsdatascience.com/why-does-increasing-k-decrease-variance-in-knn-9ed6de2f5061)

The effectiveness of each imputation technique was evaluated using RMSE and R-Squared for the resulting Ozone level models. The RMSE and R-Squared for each imputation/model combination were reported and summarized in a data frame. The combination with the lowest RMSE was determined to be the most effective imputation technique and model fit for this dataset.

Analysis and Results

Description of the Dataset

The Ozone dataset (“Ozone: Los Angeles Ozone Pollution Data, 1976”) from the mlbench package in R was chosen for this project. It contains observations related to pollution levels in the Los Angeles area during 1976. The variables contained in the dataset are thought to influence ozone concentration. These include daily ozone measurements consisting of:

  • V1: Month, 1-12, where 1 is January and 12 is December
  • V2: Day of month
  • V3: Day of week, 1-7, where 1 is Monday and 7 is Sunday
  • V4: Daily maximum one-hour-average ozone reading
  • V5: 500 millibar pressure height (m) measured at Vandenberg AFB
  • V6: Wind speed (mph) at Los Angeles International Airport (LAX)
  • V7: Humidity (%) at LAX
  • V8: Temperature (degrees F) measured at Sandburg, CA
  • V9: Temperature (degrees F) measured at El Monte, CA
  • V10: Inversion base height (feet) at LAX
  • V11: Pressure gradient (mm Hg) from LAX to Daggett, CA
  • V12: Inversion base temperature (degrees F) at LAX
  • V13: Visibility (miles) measured at LAX

The dataset allows for analysis of how different meteorological conditions/factors can affect ozone levels. This dataset was chosen because it already has a high frequency of missing data, which is representative of a real-world application for missing data. It contains a total of 13 variables, 366 observations (one for each day for one year), and 139 missing or NA values for V9, which represents 38% of the total observations in the dataset. In practical applications, the missing values are unknown (not simulated), so it is up to the investigator to choose a methodology for handling the missing data. The investigator will also need to choose appropriate metrics for evaluating the effectiveness of different methods for handling missing data.

Description of the Imputation Workflow(RS,EM)

The data were first pre-processed. This step typically entails importing the data, checking for missing values, checking data types, and making transformations as needed. In this case, all columns were converted to numeric data type and renamed for easier interpretation. Missing values by column were reported in summary statistics, but were not addressed until further visualization was performed.

Following pre-processing, correlation coefficients were calculated for all variables with respect to Ozone levels (the response variable). Variables with strong positive or negative correlation were then visualized using histograms to show the distribution of each variable with respect to percentage of Ozone level readings. Missing data were then visualized to determine if there were any significant patterns to the missingness of the data. The purpose of this step is to provide evidence to support that the data are either MCAR, MAR, or MNAR.

Missing data were then imputed with various methodologies, including list-wise deletion, feature removal, mean imputation, mode imputation, median imputation, MICE, and missForest. This resulted in 7 unique datasets, where missing data were imputed using different metholodologies. To compare the effectiveness of each imputation method, each of the imputed datasets were split into test and training datasets. The training set was used to develop machine learning models to predict Ozone levels, including decision trees, random forest, and KNN. The resulting RMSE from each machine learning model and imputation method were used to identify the most effective imputation method and model combination for this dataset.

Load Packages

Code
knitr::opts_chunk$set(echo = TRUE)
#install.packages('mlbench')

library(mlbench)
library(dplyr)
library(ggplot2)
library(vtable)
library(knitr)
library(kableExtra)
library(tidyverse)
library(tidyr)
library(naniar)
library(ggplot2)
library(UpSetR)
library(corrplot)
library(gridExtra)
library(reshape2)
library(ggpubr)
library(caTools)
library(party)
library(magrittr)
library(ggfittext)

Importing and Summarizing the Data

First, the data were imported and converted to a data frame. Next, columns were reordered such that V4 (response variable) is the first column. Lastly, the data were summarized using the summary() function. The summary function gives the number of NA’s per column. There are 139 missing values for V9 in the dataset. The number of missing values (NA’s) in each column are as follows:

  • V4: 5
  • V1: None
  • V2: None
  • V3: None
  • V5: 12
  • V6: None
  • V7: 15
  • V8: 2
  • V9: 139
  • V10: 15
  • V11: 1
  • V12: 14
  • V13: None

Import Data

Code
# import data
data("Ozone", package = "mlbench")

# convert to data frame
ozone1 <- data.frame(Ozone)
# Reorder columns
ozone1 <- ozone1 %>%
  dplyr::select(V4,V1,V2,V3,V5,V6,V7,V8,V9,V10,V11,V12,V13)

# View summary of data
summary(ozone1)
       V4              V1            V2      V3           V5      
 Min.   : 1.00   1      : 31   1      : 12   1:52   Min.   :5320  
 1st Qu.: 5.00   3      : 31   2      : 12   2:52   1st Qu.:5700  
 Median : 9.00   5      : 31   3      : 12   3:52   Median :5770  
 Mean   :11.53   7      : 31   4      : 12   4:53   Mean   :5753  
 3rd Qu.:16.00   8      : 31   5      : 12   5:53   3rd Qu.:5830  
 Max.   :38.00   10     : 31   6      : 12   6:52   Max.   :5950  
 NA's   :5       (Other):180   (Other):294   7:52   NA's   :12    
       V6               V7              V8              V9       
 Min.   : 0.000   Min.   :19.00   Min.   :25.00   Min.   :27.68  
 1st Qu.: 3.000   1st Qu.:49.00   1st Qu.:51.00   1st Qu.:49.73  
 Median : 5.000   Median :65.00   Median :62.00   Median :57.02  
 Mean   : 4.869   Mean   :58.48   Mean   :61.91   Mean   :56.85  
 3rd Qu.: 6.000   3rd Qu.:73.00   3rd Qu.:72.00   3rd Qu.:66.11  
 Max.   :11.000   Max.   :93.00   Max.   :93.00   Max.   :82.58  
                  NA's   :15      NA's   :2       NA's   :139    
      V10            V11             V12             V13       
 Min.   : 111   Min.   :-69.0   Min.   :27.50   Min.   :  0.0  
 1st Qu.: 890   1st Qu.:-10.0   1st Qu.:51.26   1st Qu.: 70.0  
 Median :2125   Median : 24.0   Median :62.24   Median :110.0  
 Mean   :2591   Mean   : 17.8   Mean   :60.93   Mean   :123.3  
 3rd Qu.:5000   3rd Qu.: 45.0   3rd Qu.:70.52   3rd Qu.:150.0  
 Max.   :5000   Max.   :107.0   Max.   :91.76   Max.   :500.0  
 NA's   :15     NA's   :1       NA's   :14                     

Data Pre-Processing

The data type was evaluated for each column. V1, V2, and V3 (Month, day of month, and day of week) were found to be “factor” data type, while the remaining columns were found to be be “numeric”. V1, V2, and V3 columns were converted to numeric data type.

Check data types

Code
# Create function to check data types of data frame columns
check_data_types <- function(ozone1) {
  sapply(ozone1, class)
}

# Check data types of the columns
data_types <- check_data_types(ozone1)
print(data_types)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric"  "factor"  "factor"  "factor" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 
Code
# Function to convert specified columns from factor to numeric
convert_factors_to_numeric <- function(data, columns) {
  data[columns] <- lapply(data[columns], function(x) as.numeric(as.character(x)))
  return(data)
}

# Convert first 3 columns from factor to numeric
columns_to_convert <- c("V1", "V2", "V3")
ozone_date <- convert_factors_to_numeric(ozone1, columns_to_convert)
ozone2 <- convert_factors_to_numeric(ozone1, columns_to_convert)

# Check data types of the columns again
data_types <- sapply(ozone_date, class)
print(data_types)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 
Code
data_types2 <- sapply(ozone2, class)
print(data_types2)
       V4        V1        V2        V3        V5        V6        V7        V8 
"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
       V9       V10       V11       V12       V13 
"numeric" "numeric" "numeric" "numeric" "numeric" 

Created a new column “Date” that combines date columns (month, day of month, day of week) into a single column.

Code
# combine Day and Month to create a Date column
ozone_date <- ozone_date %>%
  mutate(Date = as.Date(paste(1976, V1, V2, sep = "-"), format = "%Y-%m-%d"))

# Verify the new Date column
head(ozone_date, n=10)
   V4 V1 V2 V3   V5 V6 V7 V8    V9  V10 V11   V12 V13       Date
1   3  1  1  4 5480  8 20 NA    NA 5000 -15 30.56 200 1976-01-01
2   3  1  2  5 5660  6 NA 38    NA   NA -14    NA 300 1976-01-02
3   3  1  3  6 5710  4 28 40    NA 2693 -25 47.66 250 1976-01-03
4   5  1  4  7 5700  3 37 45    NA  590 -24 55.04 100 1976-01-04
5   5  1  5  1 5760  3 51 54 45.32 1450  25 57.02  60 1976-01-05
6   6  1  6  2 5720  4 69 35 49.64 1568  15 53.78  60 1976-01-06
7   4  1  7  3 5790  6 19 45 46.40 2631 -33 54.14 100 1976-01-07
8   4  1  8  4 5790  3 25 55 52.70  554 -28 64.76 250 1976-01-08
9   6  1  9  5 5700  3 73 41 48.02 2083  23 52.52 120 1976-01-09
10  7  1 10  6 5700  3 59 44    NA 2654  -2 48.38 120 1976-01-10

Rename columns for clearer interpretation during analysis

Code
# rename columns
ozone2 <- plyr::rename(ozone1, c('V4'="Ozone_reading",
                                 'V1'="Month", 
                                 'V2'="Day_of_month",
                                 'V3'="Day_of_week", 
                                 'V5'="Pressure_afb", 
                                 'V6'="Wind_speed_LAX", 
                                 'V7'="Humidity_LAX", 
                                 'V8'="Temp_sandburg", 
                                 'V9'="Temp_EM", 
                                 'V10'="IBH_LAX", 
                                 'V11'="Pressure_gradient", 
                                 'V12'="IBT_LAX", 
                                 'V13'="Visibility_LAX"))

ozone_date2 <- plyr::rename(ozone_date, c('V4'="Ozone_reading",
                                 'V1'="Month", 
                                 'V2'="Day_of_month",
                                 'V3'="Day_of_week", 
                                 'V5'="Pressure_afb", 
                                 'V6'="Wind_speed_LAX", 
                                 'V7'="Humidity_LAX", 
                                 'V8'="Temp_sandburg", 
                                 'V9'="Temp_EM", 
                                 'V10'="IBH_LAX", 
                                 'V11'="Pressure_gradient", 
                                 'V12'="IBT_LAX", 
                                 'V13'="Visibility_LAX"))
head(ozone2, n=10)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1              3     1            1           4         5480              8
2              3     1            2           5         5660              6
3              3     1            3           6         5710              4
4              5     1            4           7         5700              3
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
10             7     1           10           6         5700              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1            20            NA      NA    5000               -15   30.56
2            NA            38      NA      NA               -14      NA
3            28            40      NA    2693               -25   47.66
4            37            45      NA     590               -24   55.04
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
10           59            44      NA    2654                -2   48.38
   Visibility_LAX
1             200
2             300
3             250
4             100
5              60
6              60
7             100
8             250
9             120
10            120
Code
head(ozone_date2, n=10)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1              3     1            1           4         5480              8
2              3     1            2           5         5660              6
3              3     1            3           6         5710              4
4              5     1            4           7         5700              3
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
10             7     1           10           6         5700              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1            20            NA      NA    5000               -15   30.56
2            NA            38      NA      NA               -14      NA
3            28            40      NA    2693               -25   47.66
4            37            45      NA     590               -24   55.04
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
10           59            44      NA    2654                -2   48.38
   Visibility_LAX       Date
1             200 1976-01-01
2             300 1976-01-02
3             250 1976-01-03
4             100 1976-01-04
5              60 1976-01-05
6              60 1976-01-06
7             100 1976-01-07
8             250 1976-01-08
9             120 1976-01-09
10            120 1976-01-10

Features Legend: * Ozone_reading: Daily maximum one-hour-average ozone reading * Month: 1 = January, …, 12 = December * Day of month: 1-30/31 * Day of week: 1 = Monday, …, 7 = Sunday * Pressure_afb: 500 millibar pressure height (m) measured at Vandenberg AFB * Wind_speed_LAX: Wind speed (mph) at Los Angeles International Airport (LAX) * Humidity_LAX: Humidity (%) at LAX * Temp_sandburg: Temperature (degrees F) measured at Sandburg, CA * Temp_EM: Temperature (degrees F) measured at El Monte, CA * IBH_LAX: Inversion base height (feet) at LAX * Pressure_gradient: Pressure gradient (mm Hg) from LAX to Daggett, CA * IBT_LAX: Inversion base temperature (degrees F) at LAX * Visibility_LAX: Visibility (miles) measured at LAX

Data Exploration

Summary Statistics by Day of Week No notable trend in ozone levels by day of the week, except that mean and median ozone levels appear to be lower on day 4 of the week (Thursdays).

Code
# Summary statistics by day of the week 
ozone_summary_by_day <- ozone2 %>%
  group_by(Day_of_week) %>%
  summarize(
    mean_ozone = mean(Ozone_reading, na.rm = TRUE),
    median_ozone = median(Ozone_reading, na.rm = TRUE),
    max_ozone = max(Ozone_reading, na.rm = TRUE),
    min_ozone = min(Ozone_reading, na.rm = TRUE)
  )

print(ozone_summary_by_day)
# A tibble: 7 × 5
  Day_of_week mean_ozone median_ozone max_ozone min_ozone
  <fct>            <dbl>        <dbl>     <dbl>     <dbl>
1 1                11.6           9          38         2
2 2                12.6           9.5        34         2
3 3                12.6          12          33         2
4 4                 9.58          7          26         1
5 5                11.2           9          32         2
6 6                12.1          10          33         2
7 7                10.9          10          27         1

Summary Statistics by Month Mean and median ozone levels are lowest in the months of January, February, and March on average. They increase during the summer months, peaking in July, and gradually decline through the remainder of the year.

Code
ozone_summary_by_month <- ozone2 %>%
  group_by(Month) %>%
  summarize(
    mean_ozone = mean(Ozone_reading, na.rm = TRUE),
    median_ozone = median(Ozone_reading, na.rm = TRUE),
    max_ozone = max(Ozone_reading, na.rm = TRUE),
    min_ozone = min(Ozone_reading, na.rm = TRUE)
  )

print(ozone_summary_by_month)
# A tibble: 12 × 5
   Month mean_ozone median_ozone max_ozone min_ozone
   <fct>      <dbl>        <dbl>     <dbl>     <dbl>
 1 1           5.45          5          11         3
 2 2           7.14          6          23         2
 3 3           8.84          9          24         2
 4 4          10             9          22         3
 5 5          15.8          16          33         2
 6 6          17.1          17          30         2
 7 7          20.2          19          34        10
 8 8          17.5          16          38         2
 9 9          12.4          11.5        29         2
10 10         12.6          12          30         3
11 11          7.67          5.5        20         1
12 12          4.06          4           8         1

Examine how feature variables correlate with Ozone levels

Correlation coefficients were determined for each variable with respect to Ozone levels, the response variable. A bar graph was used to summarize the correlation coefficients. The variables with strong correlations were:

  • Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg
  • Strong Negative Correlations: IBH_LAX and Visibility_LAX
Code
# Make sure data is in numeric form
ozone2[] <- lapply(ozone2, as.numeric)

# Calculate correlations with Ozone
corr_coeffs <- cor(ozone2, use = "complete.obs")['Ozone_reading', ]
corr_coeffs <- corr_coeffs[!names(corr_coeffs) %in% 'Ozone_reading']

# Create a data frame for plotting
corr_df <- data.frame(Variable = names(corr_coeffs), Correlation = corr_coeffs)

# Create the bar graph
ggplot(corr_df, aes(x = reorder(Variable, Correlation), y = Correlation)) +
  geom_bar(stat = 'identity',fill = "blue") +
  xlab('Variable') +
  ylab('Correlation Coefficient') +
  ggtitle('Correlation Between Variables and Daily Average Ozone Reading') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Explore Interactions with Positively Correlated Variables

Strong Positive Correlations: Humidity_LAX, Pressure_afb, IBT_LAX, Temp_EM, and Temp_sandburg

The percentage of ozone readings appears to be approximately normally distributed with respect to Pressure_afb, IBT_LAX, and Temp_sandburg. There appears to be some negative skew with respect to humidity, with a secondary mode at the lowest humidity bin (10-20). There are a lot of missing data points for Temp_EM (139 observations out of 366), as discussed in the initial data summary. This represents 38.0% of the observations for Temp_EM.

Code
## Humidity
# Create bins for humidity levels
ozone3 <- ozone2 %>%
  mutate(humidity_bin = cut(Humidity_LAX, breaks = seq(10, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data1 <- ozone3 %>%
  group_by(humidity_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data1, aes(x = humidity_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Humidity Levels",
       x = "Humidity Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Pressure
# Create bins for humidity levels
ozone3 <- ozone2 %>%
  mutate(pressure_bin = cut(Pressure_afb, breaks = seq(5300, 6000, by = 100), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data2 <- ozone3 %>%
  group_by(pressure_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data2, aes(x = pressure_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Pressure Levels",
       x = "Pressure Levels",
       y = "Percentage of Ozone Readings") +
  theme(axis.text.x = element_text(angle = 45, vjust=0.5, size=8))

Code
## IBT - Inversion base temperature (degrees F) at LAX
# Create bins for Inversion base temp levels
ozone3 <- ozone2 %>%
  mutate(IBT_bin = cut(IBT_LAX, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data3 <- ozone3 %>%
  group_by(IBT_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data3, aes(x = IBT_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Inversion base temp Levels",
       x = "Inversion base temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Temp_EM - Temperature (degrees F) measured at El Monte, CA
# Create bins for temp levels
ozone3 <- ozone2 %>%
  mutate(Temp_EM_bin = cut(Temp_EM, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data4 <- ozone3 %>%
  group_by(Temp_EM_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data4, aes(x = Temp_EM_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Temp at El Monte Levels",
       x = "Temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Code
## Temp_sandburg
# Create bins for temp levels
ozone3 <- ozone2 %>%
  mutate(Temp_sd_bin = cut(Temp_sandburg, breaks = seq(20, 100, by = 10), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data5 <- ozone3 %>%
  group_by(Temp_sd_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data5, aes(x = Temp_sd_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Temp at Sandburg Levels",
       x = "Temp Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Explore Interactions with Negatively Correlated Variables

  • Strong Negative Correlations: IBH_LAX and Visibility_LAX The percentage of ozone readings does not appear to be normally distributed with respect to IBH_LAX and Visbility_LAX. The highest proportion of Ozone readings are at high IBH (4600-5100), with a secondary mode at low IBH (600-1100). The proportion of Ozone readings are positively skewed with respect to Visbility_LAX, with the mode occurring between 50 and 100, and a secondary mode occuring between 250 and 300.
Code
## IBH_LAX - Inversion base height (feet) at LAX
# Create bins for IBH levels
ozone3 <- ozone2 %>%
  mutate(IBH_bin = cut(IBH_LAX, breaks = seq(100, 5500, by = 500), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data6 <- ozone3 %>%
  group_by(IBH_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data6, aes(x = IBH_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by IBH Levels",
       x = "IBH Levels",
       y = "Percentage of Ozone Readings") +
  theme(axis.text.x = element_text(angle = 45, vjust=0.5))

Code
## Visibility
# Create bins for Visibility levels
ozone3 <- ozone2 %>%
  mutate(visibility_bin = cut(Visibility_LAX, breaks = seq(0, 500, by = 50), include.lowest = TRUE))

# Calculate percentage of ozone readings for each bin
percentage_data7 <- ozone3 %>%
  group_by(visibility_bin) %>%
  summarize(Ozone_reading = n()) %>%
  mutate(percentage = (Ozone_reading / sum(Ozone_reading)) * 100)

# Create the bar graph
ggplot(percentage_data7, aes(x = visibility_bin, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Ozone Readings by Visibility Levels",
       x = "Visibility Levels",
       y = "Percentage of Ozone Readings") +
  theme_minimal()

Exploration of the Missing Data

The total number of missing data points for each column is shown below as a count and as a percentage. Most of the columns contain <5% missing values. Temp_EM contains 139 missing values, which is 38.0% of the observations. The total number of missing values in the dataset is 203 out of 4,768 (13 columns times 366 obervations). This represents 4.3% of the entire dataset.

Summarize missing values

Code
# Function to summarize missing values in a data frame
summarize_missing_values <- function(data) {
  data %>%
    summarize_all(~ sum(is.na(.))) %>%
    gather(key = "column", value = "missing_values") %>%
    mutate(missing_percentage = (missing_values / nrow(data)) * 100)
}

missing_summary <- summarize_missing_values(ozone2)

# Print the summary of missing values
print(missing_summary)
              column missing_values missing_percentage
1      Ozone_reading              5          1.3661202
2              Month              0          0.0000000
3       Day_of_month              0          0.0000000
4        Day_of_week              0          0.0000000
5       Pressure_afb             12          3.2786885
6     Wind_speed_LAX              0          0.0000000
7       Humidity_LAX             15          4.0983607
8      Temp_sandburg              2          0.5464481
9            Temp_EM            139         37.9781421
10           IBH_LAX             15          4.0983607
11 Pressure_gradient              1          0.2732240
12           IBT_LAX             14          3.8251366
13    Visibility_LAX              0          0.0000000
Code
print(sum(is.na(ozone2)))
[1] 203

Visualization of the Missing Data

Three different plots were generating to visualize the missing data.

The first uses the vis_miss() function and the resulting plot shows the missing values shaded in black for each column and each row. Since the rows are arranged by date, we can see that the most of the data are missing in random clusters with respect to date. Temp_EM, which has the highest proportion of missing data (38%), appears to be missing in pretty regular intervals. This was further explored in the next section.

Code
# Plot missing data pattern
#Shows what percentage of the data are missing from each column
vis_miss(ozone2)

The second plot uses the gg_miss_upset() function. This plot shows the number of missing values not only by each column, but by each possbile combination of missing values (intersections). For example, the third bar in this figure shows that there are 9 rows that are missing values for IBT_LAX, Humidity_LAX, and IBH_LAX. The majority of the rows with missing data appear to be only missing Temp_EM (127). The next highest intersection is Pressure_afb_NA, which contains 10 rows.

Code
#This plot gives a visual of what combinations of NAs are present and how many there are for each
#set nsets to 8 since we have 8 columns with missing data
gg_miss_upset(ozone2, nsets=8)

The third plot was generated using the gg_miss_var() function and shows the number of missing values for each variable. Again, Temp_EM has by far the largest number of missing points (139).

Code
#Another way to visualize number of missing rows per column
gg_miss_var(ozone2) + labs(y = "Number of missing values") + ylim(0, 150)

Randomness Testing of Missing Data The Little’s MCAR test was performed to analyze randomness. * Null Hypothesis (H0): The data is missing completely at random (MCAR). * Alternative Hypothesis (H1): The data is not missing completely at random. Based on the p-value(4.102052e-12) from the test, the null hypothesis is rejected, data is not missing completely at random (MCAR).

Code
# Perform Little's MCAR test
mcar_result <- mcar_test(ozone2)
print(mcar_result)
# A tibble: 1 × 4
  statistic    df  p.value missing.patterns
      <dbl> <dbl>    <dbl>            <int>
1      278.   134 4.10e-12               13

##Additionally, the Hawkin’s and the Non-Parametric tests were performed to analyze missing data randomness.

Test Data for Normality In order to interpret the tests, the data must be checked for normality. The Shapiro-Wilk and the Anderson-Darling tests were performed.

Code
# Extract the dependent variable
ozone_levels <- ozone2$Ozone_reading

# Remove NA values
ozone_levels <- na.omit(ozone_levels)

# Perform normality tests
library(nortest)
shapiro_test <- shapiro.test(ozone_levels)
ad_test <- ad.test(ozone_levels)

# Print the results
print(shapiro_test)

    Shapiro-Wilk normality test

data:  ozone_levels
W = 0.91044, p-value = 8.37e-14
Code
print(ad_test)

    Anderson-Darling normality test

data:  ozone_levels
A = 10.027, p-value < 2.2e-16

A histogram and QQ plot were also created

Code
# Histogram
ggplot(ozone2, aes(x = Ozone_reading)) + geom_histogram(binwidth = 5) + ggtitle("Histogram of Ozone")

Code
# Q-Q plot
qqnorm(ozone2$Ozone_reading)
qqline(ozone2$Ozone_reading)

Run Hawkins and Non-Parametric Tests

Code
library(dplyr)
explanatory = c("Temp_EM","Month", "Day_of_month","Day_of_week", "Pressure_afb", "Wind_speed_LAX", "Humidity_LAX", "Temp_sandburg", "IBH_LAX", 
                "Pressure_gradient", "IBT_LAX", "Visibility_LAX")
dependent = "Ozone_reading"

# Run randomness test
ozone2 %>%
  select(explanatory)%>%
  MissMech::TestMCARNormality()
Call:
MissMech::TestMCARNormality(data = .)

Number of Patterns:  4 

Total number of cases used in the analysis:  354 

 Pattern(s) used:
          Temp_EM   Month   Day_of_month   Day_of_week   Pressure_afb
group.1        NA       1              1             1              1
group.2         1       1              1             1              1
group.3         1       1              1             1              1
group.4         1       1              1             1             NA
          Wind_speed_LAX   Humidity_LAX   Temp_sandburg   IBH_LAX
group.1                1              1               1         1
group.2                1              1               1         1
group.3                1             NA               1        NA
group.4                1              1               1         1
          Pressure_gradient   IBT_LAX   Visibility_LAX   Number of cases
group.1                   1         1                1               129
group.2                   1         1                1               206
group.3                   1        NA                1                 9
group.4                   1         1                1                10

    Test of normality and Homoscedasticity:
  -------------------------------------------

Hawkins Test:

    P-value for the Hawkins test of normality and homoscedasticity:  0.0113103 

    Either the test of multivariate normality or homoscedasticity (or both) is rejected.
    Provided that normality can be assumed, the hypothesis of MCAR is 
    rejected at 0.05 significance level. 

Non-Parametric Test:

    P-value for the non-parametric test of homoscedasticity:  0.2743229 

    Reject Normality at 0.05 significance level.
    There is not sufficient evidence to reject MCAR at 0.05 significance level.

Hawkins Test

  • P-value for the Hawkins test of normality and homoscedasticity: 0.0113103
  • This test checks for both multivariate normality and homoscedasticity (equal variances).
  • Interpretation: Since the p-value is less than 0.05, the null hypothesis of multivariate normality or homoscedasticity (or both) is rejected. This means that the data does not follow a multivariate normal distribution, or the variances are not equal across groups.
  • MCAR Hypothesis: Provided that the above tests show that the data is not normally distributed, the hypothesis of MCAR (Missing Completely At Random) is not rejected at the 0.05 significance level.

Non-Parametric Test

  • P-value for the non-parametric test of homoscedasticity: 0.2743229
  • This test checks for homoscedasticity without assuming normality.
  • Interpretation: Since the p-value is greater than 0.05, there is not enough evidence to reject the null hypothesis of homoscedasticity. This suggests that the variances are equal across groups.
  • MCAR Hypothesis: There is not sufficient evidence to reject the hypothesis of MCAR at the 0.05 significance level.

Further Visualization of the Missing Data: Missingness Patterns

The data were further visualized using the gg_miss_fct() function to plot the proportion of missing values for each column (y-axis) broken down by one variable as a time (x-axis). The purpose of this is to determine if there are discernible patterns in the missingness of the data. This provides additional supportive information for classifying the data as MCAR, MAR, or MNAR.

Focusing on Temp_EM since it is the column with the highest number of missing values, the data appear to be missing randomly with respect to most of the variables. With respect to the date columns, the pattern of missingness appears to be random with respect to day of the month. However, it is apparent that there is a high proportion of missing data on days 6 and 7 of the week (Saturdays and Sundays). This indicates that Temp_EM was likely not measured on the weekends throughout the year. The values for Temp_EM also appear to be missing randomly with respect to month, though there is a notable high frequency of missing data around the month of May.

For the other variables in the dataset, there does not appear to be any obvious patterns to the missing data. Most importantly, the frequency of missing data for Temp_EM does not appear to depend on the output variable (ozone reading). Therefore, we can most likely proceed to imputation with the assumption that our data are missing completely at random (MCAR).

Code
# Create gg_miss_fct plots with adjusted themes
p1 <- gg_miss_fct(ozone2, fct = Month) + 
  ggtitle("Missing Data by Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p2 <- gg_miss_fct(ozone2, fct = Day_of_month) + 
  ggtitle("Missing Data by Day of Month") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p3 <- gg_miss_fct(ozone2, fct = Day_of_week) + 
  ggtitle("Missing Data by Day of Week") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p4 <- gg_miss_fct(ozone2, fct = Ozone_reading) + 
  ggtitle("Missing Data by Ozone Reading") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p5 <- gg_miss_fct(ozone2, fct = Pressure_afb) + 
  ggtitle("Missing Data by Solar Radiation") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p6 <- gg_miss_fct(ozone2, fct = Wind_speed_LAX) + 
  ggtitle("Missing Data by Wind Speed") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p7 <- gg_miss_fct(ozone2, fct = Humidity_LAX) + 
  ggtitle("Missing Data by Humidity (LAX)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p8 <- gg_miss_fct(ozone2, fct = Temp_sandburg) + 
  ggtitle("Missing Data by Temperature (Sandburg)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  theme(plot.title = element_text(size=8))

p9 <- gg_miss_fct(ozone2, fct = Temp_EM) + 
  ggtitle("Missing Data by Temperature (EM)") +
  theme(plot.title = element_text(size=8))

p10 <- gg_miss_fct(ozone2, fct = IBH_LAX) + 
  ggtitle("Missing Data by IBH_LAX") +
  theme(plot.title = element_text(size=8))

p11 <- gg_miss_fct(ozone2, fct = Pressure_gradient) + 
  ggtitle("Missing Data by Pressure Gradient") +
  theme(plot.title = element_text(size=8))

p12 <- gg_miss_fct(ozone2, fct = IBT_LAX) + 
  ggtitle("Missing Data by IBT_LAX") +
  theme(plot.title = element_text(size=8))

p13 <- gg_miss_fct(ozone2, fct = Visibility_LAX) + 
  ggtitle("Missing Data by Visibility_LAX") +
  theme(plot.title = element_text(size=8))

# Arrange the plots into grids with proper spacing
grid1 <- grid.arrange(p1, p2, p3, p4, nrow = 2)

Code
grid2 <- grid.arrange(p5, p6, p7, p8, nrow = 2)

Code
grid3 <- grid.arrange(p9, p10, p11, nrow = 2)

Code
grid4 <- grid.arrange(p12, p13, nrow = 1)

Missing Data Imputation Using Simple Methods

Missing data were deleted using listwise deletion or feature selection as a baseline comparison for imputation methods. Additionally, missing data imputed using mean, median or mode as a demonstration of simple imputation techniques. Listwise deletion is simply deleting all rows that contain a missing value. Feature removal is where a column is deleted if its total percentage of missing values is over a set threshold (we chose 20%). With mean, median, or mode imputation, all NA values are replaced by a single value for each column: the mean, median, or mode of that column. One dataframe was created for each method, resulting in a total of five dataframes:

Feature selection (column deletion) In this case, the Temp_EM column is dropped because more than 20% of the values are missing. Remaining rows with NAs are also dropped so that machine learning models can be applied.

Code
# Function to drop column if quantity of missing values is over the threshold
drop_na_columns <- function(data, threshold) {
  na_counts <- colSums(is.na(data))
  na_proportion <- na_counts / nrow(data)
  data <- data[, na_proportion <= threshold]
  return(data)
}
# Define threshold (e.g., 20% NA allowed)
threshold <- 0.20

# Drop columns based on the NA threshold
dropCol_data <- drop_na_columns(ozone2, threshold) # the column Temp_EM gets dropped
dropCol_data <- data.frame(dropCol_data)
dropCol_data <- na.omit(dropCol_data)

head(dropCol_data)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
7             4     1            7           3         5790              6
8             4     1            8           4         5790              3
  Humidity_LAX Temp_sandburg IBH_LAX Pressure_gradient IBT_LAX Visibility_LAX
3           28            40    2693               -25   47.66            250
4           37            45     590               -24   55.04            100
5           51            54    1450                25   57.02             60
6           69            35    1568                15   53.78             60
7           19            45    2631               -33   54.14            100
8           25            55     554               -28   64.76            250
Code
print(sum(is.na(dropCol_data)))
[1] 0

Listwise deletion (row deletion)

Create a new dataset where all rows with NAs are deleted.

Code
# Drop all missing values. Row deletion.
dropNA_data <- na.omit(ozone2)

head(dropNA_data)
   Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
5              5     1            5           1         5760              3
6              6     1            6           2         5720              4
7              4     1            7           3         5790              6
8              4     1            8           4         5790              3
9              6     1            9           5         5700              3
12             6     1           12           1         5720              3
   Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
5            51            54   45.32    1450                25   57.02
6            69            35   49.64    1568                15   53.78
7            19            45   46.40    2631               -33   54.14
8            25            55   52.70     554               -28   64.76
9            73            41   48.02    2083                23   52.52
12           44            51   54.32     111                 9   63.14
   Visibility_LAX
5              60
6              60
7             100
8             250
9             120
12            150
Code
print(sum(is.na(dropNA_data)))
[1] 0

Mean imputation

Create a new dataset where all missing values are replaced with the mean of the column.

Code
# Function for mean imputation
mean_impute <- function(x) {
  x[is.na(x)] <- mean(x, na.rm = TRUE)
  return(x)
}

imputed_data_mean <- apply(ozone2, 2, mean_impute)
imputed_data_mean <- data.frame(imputed_data_mean)

head(imputed_data_mean)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg  Temp_EM  IBH_LAX Pressure_gradient  IBT_LAX
1     20.00000      61.91484 56.84634 5000.000               -15 30.56000
2     58.47578      38.00000 56.84634 2590.943               -14 60.92733
3     28.00000      40.00000 56.84634 2693.000               -25 47.66000
4     37.00000      45.00000 56.84634  590.000               -24 55.04000
5     51.00000      54.00000 45.32000 1450.000                25 57.02000
6     69.00000      35.00000 49.64000 1568.000                15 53.78000
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
Code
print(sum(is.na(imputed_data_mean)))
[1] 0

Median imputation

Create a new dataset where all missing values are replaced with the median of the column.

Code
# Function for median imputation
median_impute <- function(x) {
  x[is.na(x)] <- median(x, na.rm = TRUE)
  return(x)
}
imputed_data_median <- apply(ozone2, 2, median_impute)
imputed_data_median <- data.frame(imputed_data_median)

head(imputed_data_median)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1           20            62   57.02    5000               -15   30.56
2           65            38   57.02    2125               -14   62.24
3           28            40   57.02    2693               -25   47.66
4           37            45   57.02     590               -24   55.04
5           51            54   45.32    1450                25   57.02
6           69            35   49.64    1568                15   53.78
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
Code
print(sum(is.na(imputed_data_median)))
[1] 0

Mode imputation

Create a new dataset where all missing values are replaced with the mode of the column.

Code
# Function for mode imputation (using the most common value)
mode_impute <- function(x) {
  mode_val <- as.numeric(names(sort(table(x), decreasing = TRUE)[1]))
  x[is.na(x)] <- mode_val
  return(x)
}
imputed_data_mode <- apply(ozone2, 2, mode_impute)
imputed_data_mode <- data.frame(imputed_data_mode)

head(imputed_data_mode)
  Ozone_reading Month Day_of_month Day_of_week Pressure_afb Wind_speed_LAX
1             3     1            1           4         5480              8
2             3     1            2           5         5660              6
3             3     1            3           6         5710              4
4             5     1            4           7         5700              3
5             5     1            5           1         5760              3
6             6     1            6           2         5720              4
  Humidity_LAX Temp_sandburg Temp_EM IBH_LAX Pressure_gradient IBT_LAX
1           20            51   54.32    5000               -15   30.56
2           19            38   54.32    5000               -14   68.72
3           28            40   54.32    2693               -25   47.66
4           37            45   54.32     590               -24   55.04
5           51            54   45.32    1450                25   57.02
6           69            35   49.64    1568                15   53.78
  Visibility_LAX
1            200
2            300
3            250
4            100
5             60
6             60
Code
print(sum(is.na(imputed_data_mode)))
[1] 0

Missing Data Imputation Using MICE (Multiple Imputation Method) and missForest (Machine Learning Method)

Next, the missing data were imputed using MICE and Miss_Forest methods.

missForest

Code
library(missForest)

# verbose = If 'TRUE' the user is supplied with additional output between iterations
# xtrue = Complete data matrix
ozone2_mf <- missForest(ozone2, xtrue = ozone2, verbose = TRUE)
# convert back to data frame
ozone2_mf <- as.data.frame(ozone2_mf$ximp)
print(sum(is.na(ozone2_mf)))

## The final results can be accessed directly. The estimated error:
ozone2_mf$OOBerror

## The true imputation error (if available):
ozone2_mf$error

MICE

Code
library(mice)

# Impute missing values using MICE
# pmm = Predictive Mean Matching (suitable for continuous variables like temperature, wind, etc.)
# m = 5: number of imputed datasets to create.
# maxit = 50: max number of iterations
ozone2_mice <- mice(ozone2, method = "pmm", m = 5, maxit = 50)

# extracts the completed datasets from the mice object
ozone2_mice <- complete(ozone2_mice)

# Convert completed data to data frame
ozone2_mice <- as.data.frame(ozone2_mice)

print(sum(is.na(ozone2_mice)))

Model-Fitting of Imputed Datasets

Make Predictions Using Data Where Missing Values Were Deleted Using Listwise Deletion (Deletion of all missing rows with missing data)

The listwise deleted dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

Load additional libraries and split the data into training and testing sets

Code
#Load additional libraries for model fitting
library(caret) # for fitting KNN models
library(e1071)
library(rsample) # for creating validation splits
library(recipes)    # for feature engineering
library(randomForest)
library(rpart)# decision tree
library(tidymodels) 
library(class) 
library(vip) 

# Split the data into training and testing sets
set.seed(123)
data_split_dropNA <- initial_split(dropNA_data, prop = 0.75)
train_dropNA <- training(data_split_dropNA)
test_dropNA <- testing(data_split_dropNA)

# Split data into predictors and target
X <- train_dropNA[, -1]  # Features
y <- train_dropNA$Ozone_reading  # Target

# Function to calculate RMSE
rmse <- function(pred, actual) {
  sqrt(mean((pred - actual)^2))
}

Random Forest

Code
# Random forest model
rf_dropNA <- randomForest(Ozone_reading ~ ., data = train_dropNA)
# make predictions
rf_dropNA_pred <- predict(rf_dropNA, newdata = test_dropNA)
# Plot variable importance
varImpPlot(rf_dropNA, main = "Variable Importance Plot for Random Forest Model")

Code
rf_dropNA_rmse <- rmse(rf_dropNA_pred, test_dropNA$Ozone_reading)
cat("Random Forest RMSE:", rf_dropNA_rmse, "\n")
Random Forest RMSE: 4.591508 

KNN

Code
# KNN model
knn_dropNA <- train(Ozone_reading ~ ., data = train_dropNA, method = "knn")
# make predictions
knn_dropNA_pred <- predict(knn_dropNA, newdata = test_dropNA)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_dropNA_rmse <- rmse(knn_dropNA_pred, test_dropNA$Ozone_reading)
cat("KNN RMSE:", knn_dropNA_rmse, "\n")
KNN RMSE: 6.161134 

Decision Tree

Code
# Decision tree model
tree_dropNA <- rpart(Ozone_reading ~ ., data = train_dropNA)
# make predictions
tree_dropNA_pred <- predict(tree_dropNA, newdata = test_dropNA)
# Plot variable importance
var_importance <- varImp(tree_dropNA)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_dropNA_rmse <- rmse(tree_dropNA_pred, test_dropNA$Ozone_reading)
cat("Decision Tree RMSE:", tree_dropNA_rmse, "\n")
Decision Tree RMSE: 5.799608 

Make Predictions Using Data Where Columns with >20% Missing Data Were Deleted (Temp_EM column)

The deleted column dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the column deletion method (feature selection), random forest had the best model fit, with an RMSE of 3.66.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_dropCol <- initial_split(dropCol_data, prop = 0.75)
train_dropCol <- training(data_split_dropCol)
test_dropCol <- testing(data_split_dropCol)

# Split data into predictors and target
X <- train_dropCol[, -1]  # Features
y <- train_dropCol$Ozone_reading  # Target

Random Forest

Code
# Random forest model
rf_dropCol <- randomForest(Ozone_reading ~ ., data = train_dropCol)
# make predictions
rf_dropCol_pred <- predict(rf_dropCol, newdata = test_dropCol)
# Plot variable importance
varImpPlot(rf_dropCol, main = "Variable Importance Plot for Random Forest Model")

Code
rf_dropCol_rmse <- rmse(rf_dropCol_pred, test_dropCol$Ozone_reading)
cat("Random Forest RMSE:", rf_dropCol_rmse, "\n")
Random Forest RMSE: 3.659657 

KNN

Code
# KNN model
knn_dropCol <- train(Ozone_reading ~ ., data = train_dropCol, method = "knn")
# make predictions
knn_dropCol_pred <- predict(knn_dropCol, newdata = test_dropCol)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_dropCol_rmse <- rmse(knn_dropCol_pred, test_dropCol$Ozone_reading)
cat("KNN RMSE:", knn_dropCol_rmse, "\n")
KNN RMSE: 5.548309 

Decision Tree

Code
# Decision tree model
tree_dropCol <- rpart(Ozone_reading ~ ., data = train_dropCol)
# make predictions
tree_dropCol_pred <- predict(tree_dropCol, newdata = test_dropCol)
# Plot variable importance
var_importance <- varImp(tree_dropCol)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_dropCol_rmse <- rmse(tree_dropCol_pred, test_dropCol$Ozone_reading)
cat("Decision Tree RMSE:", tree_dropCol_rmse, "\n")
Decision Tree RMSE: 4.266184 

Make Predictions Using Data Where Missing Values were Imputed using the Mean

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mean imputation method, random forest had the best model fit, with an RMSE of 4.90.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_mean <- initial_split(imputed_data_mean, prop = 0.75)
train_mean <- training(data_split_mean)
test_mean <- testing(data_split_mean)

# Split data into predictors and target
X <- train_mean[, -1]  # Features
y <- train_mean$Ozone_reading  # Target

# Random forest model
rf_mean <- randomForest(Ozone_reading ~ ., data = train_mean)
# make predictions
rf_mean_pred <- predict(rf_mean, newdata = test_mean)
# Plot variable importance
varImpPlot(rf_mean, main = "Variable Importance Plot for Random Forest Model")

Code
rf_mean_rmse <- rmse(rf_mean_pred, test_mean$Ozone_reading)
cat("Random Forest RMSE:", rf_mean_rmse, "\n")
Random Forest RMSE: 4.901004 

KNN

Code
# KNN model
knn_mean <- train(Ozone_reading ~ ., data = train_mean, method = "knn")
# make predictions
knn_mean_pred <- predict(knn_mean, newdata = test_mean)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_mean_rmse <- rmse(knn_mean_pred, test_mean$Ozone_reading)
cat("KNN RMSE:", knn_mean_rmse, "\n")
KNN RMSE: 6.117349 

Decision Tree

Code
# Decision tree model
tree_mean <- rpart(Ozone_reading ~ ., data = train_mean)
# make predictions
tree_mean_pred <- predict(tree_mean, newdata = test_mean)
# Plot variable importance
var_importance <- varImp(tree_mean)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_mean_rmse <- rmse(tree_mean_pred, test_mean$Ozone_reading)
cat("Decision Tree RMSE:", tree_mean_rmse, "\n")
Decision Tree RMSE: 5.213247 

Make Predictions Using Data Where Missing Values were Imputed using the Median

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mean imputation method, random forest had the best model fit, with an RMSE of 5.07.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_med <- initial_split(imputed_data_median, prop = 0.75)
train_med <- training(data_split_med)
test_med <- testing(data_split_med)

Random Forest

Code
# Random forest model
rf_med <- randomForest(Ozone_reading ~ ., data = train_med)
rf_med_pred <- predict(rf_med, newdata = test_med)
# Plot variable importance
varImpPlot(rf_med, main = "Variable Importance Plot for Random Forest Model")

Code
rf_med_rmse <- rmse(rf_med_pred, test_med$Ozone_reading)
cat("Random Forest RMSE:", rf_med_rmse, "\n")
Random Forest RMSE: 5.073677 

KNN

Code
# KNN model
knn_med <- train(Ozone_reading ~ ., data = train_med, method = "knn")
knn_med_pred <- predict(knn_med, newdata = test_med)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_med_rmse <- rmse(knn_med_pred, test_med$Ozone_reading)
cat("KNN RMSE:", knn_med_rmse, "\n")
KNN RMSE: 6.298989 

Decision Tree

Code
# Decision tree model
tree_med <- rpart(Ozone_reading ~ ., data = train_med)
tree_med_pred <- predict(tree_med, newdata = test_med)
# Plot variable importance
var_importance <- varImp(tree_med)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_med_rmse <- rmse(tree_med_pred, test_med$Ozone_reading)
cat("Decision Tree RMSE:", tree_med_rmse, "\n")
Decision Tree RMSE: 5.249752 

Make Predictions Using Data Where Missing Values were Imputed using the Mode

The mean imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the mode imputation method, random forest had the best model fit, with an RMSE of 5.40.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_mode <- initial_split(imputed_data_mode, prop = 0.75)
train_mode <- training(data_split_mode)
test_mode <- testing(data_split_mode)

Random Forest

Code
# Random forest model
rf_mode <- randomForest(Ozone_reading ~ ., data = train_mode)
rf_mode_pred <- predict(rf_mode, newdata = test_mode)
# Plot variable importance
varImpPlot(rf_mode, main = "Variable Importance Plot for Random Forest Model")

Code
rf_mode_rmse <- rmse(rf_mode_pred, test_mode$Ozone_reading)
cat("Random Forest RMSE:", rf_mode_rmse, "\n")
Random Forest RMSE: 5.398429 

KNN

Code
# KNN model
knn_mode <- train(Ozone_reading ~ ., data = train_mode, method = "knn")
knn_mode_pred <- predict(knn_mode, newdata = test_mode)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_mode_rmse <- rmse(knn_mode_pred, test_mode$Ozone_reading)
cat("KNN RMSE:", knn_mode_rmse, "\n")
KNN RMSE: 6.371706 

Decision Tree

Code
# Decision tree model
tree_mode <- rpart(Ozone_reading ~ ., data = train_mode)
tree_mode_pred <- predict(tree_mode, newdata = test_mode)
# Plot variable importance
var_importance <- varImp(tree_mode)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_mode_rmse <- rmse(tree_mode_pred, test_mode$Ozone_reading)
cat("Decision Tree RMSE:", tree_mode_rmse, "\n")
Decision Tree RMSE: 6.603588 

Make Predictions using data where missing values were imputed with complex methods:

Make Predictions Using Data Where Missing Values were Imputed using missForest Method

The missForest imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the missForest imputation method, random forest had the best model fit, with an RMSE of 4.46.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_miss <- initial_split(ozone2_mf, prop = 0.75)
train_miss <- training(data_split_miss)
test_miss <- testing(data_split_miss)

Random Forest

Code
# Random forest model
rf_miss <- randomForest(Ozone_reading ~ ., data = train_miss)
rf_miss_pred <- predict(rf_miss, newdata = test_miss)
# Plot variable importance
varImpPlot(rf_miss, main = "Variable Importance Plot for Random Forest Model")

Code
rf_miss_rmse <- rmse(rf_miss_pred, test_miss$Ozone_reading)
cat("Random Forest RMSE:", rf_miss_rmse, "\n")
Random Forest RMSE: 4.493497 

KNN

Code
# KNN model
knn_miss <- train(Ozone_reading ~ ., data = train_miss, method = "knn")
knn_miss_pred <- predict(knn_miss, newdata = test_miss)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_miss_rmse <- rmse(knn_miss_pred, test_miss$Ozone_reading)
cat("KNN RMSE:", knn_miss_rmse, "\n")
KNN RMSE: 6.118703 

Decision Tree

Code
# Decision tree model
tree_miss <- rpart(Ozone_reading ~ ., data = train_miss)
tree_miss_pred <- predict(tree_miss, newdata = test_miss)
# Plot variable importance
var_importance <- varImp(tree_miss)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_miss_rmse <- rmse(tree_miss_pred, test_miss$Ozone_reading)
cat("Decision Tree RMSE:", tree_miss_rmse, "\n")
Decision Tree RMSE: 5.537124 

Make Predictions Using Data Where Missing Values were Imputed using MICE Method

The MICE imputed dataset was split into testing and training datasets using a split ratio of 0.75. Models were then fit using the following algorithms:

  • Random Forest
  • K-Nearest Neighbors (KNN)
  • Decision Tree

After models were fit to the training data, predictions were made on the unseen test data. RMSE was then reported for each model.

For the MICE imputation method, random forest had the best model fit, with an RMSE of 4.70.

Code
# Split the data into training and testing sets
set.seed(123)
data_split_mice <- initial_split(ozone2_mice, prop = 0.75)
train_mice <- training(data_split_mice)
test_mice <- testing(data_split_mice)

Random Forest

Code
# Random forest model
rf_mice <- randomForest(Ozone_reading ~ ., data = train_mice)
rf_mice_pred <- predict(rf_mice, newdata = test_mice)
# Plot variable importance
varImpPlot(rf_mice, main = "Variable Importance Plot for Random Forest Model")

Code
rf_mice_rmse <- rmse(rf_mice_pred, test_mice$Ozone_reading)
cat("Random Forest RMSE:", rf_mice_rmse, "\n")
Random Forest RMSE: 4.367134 

KNN

Code
# KNN model
knn_mice <- train(Ozone_reading ~ ., data = train_mice, method = "knn")
knn_mice_pred <- predict(knn_mice, newdata = test_mice)
# Plot variable importance
feature_importance <- table(y) / length(y)
barplot(feature_importance, main = "Feature Importance for KNN", xlab = "Feature", ylab = "Importance")

Code
knn_mice_rmse <- rmse(knn_mice_pred, test_mice$Ozone_reading)
cat("KNN RMSE:", knn_mice_rmse, "\n")
KNN RMSE: 6.142781 

Decision Tree

Code
# Decision tree model
tree_mice <- rpart(Ozone_reading ~ ., data = train_mice)
tree_mice_pred <- predict(tree_mice, newdata = test_mice)
# Plot variable importance
var_importance <- varImp(tree_mice)
barplot(var_importance$Overall, main = "Variable Importance for Decision Tree", xlab = "Variable", ylab = "Importance")

Code
tree_mice_rmse <- rmse(tree_mice_pred, test_mice$Ozone_reading)
cat("Decision Tree RMSE:", tree_mice_rmse, "\n")
Decision Tree RMSE: 5.077593 

Summarize Model Performance for Each Imputation Methodology

The RMSE for each model and imputation method combination was summarized into a data frame. The results are shown below. The best model fit (lowest RMSE) was obtained when using feature selection (column deletion) for imputation of missing data and the random forest algorithm for model training with the resulting dataset. Mean imputation with the Decision Tree model fitting was the second best combination, followed by mode imputation with Random Forest.

Code
models <- c('RandomForest', 'KNN', 'DecisionTree')
scores <- c(rf_dropCol_rmse, knn_dropCol_rmse, tree_dropCol_rmse,
            rf_dropNA_rmse, knn_dropNA_rmse, tree_dropNA_rmse,
            rf_mean_rmse, knn_mean_rmse, tree_mean_rmse,
            rf_med_rmse, knn_med_rmse, tree_med_rmse,
            rf_mode_rmse, knn_mode_rmse, tree_mode_rmse,
            rf_miss_rmse, knn_miss_rmse, tree_miss_rmse,
            rf_mice_rmse, knn_mice_rmse, tree_mice_rmse
)
ImpMethod <- c('DropCol','DropNA','Mean', 'Median', 'Mode','missForest','MICE')

# Create dataframe
rmse_df <- data.frame(Model = models, ImpMethod=ImpMethod,RMSE = scores)
print(rmse_df[order(rmse_df$RMSE), ])
          Model  ImpMethod     RMSE
1  RandomForest    DropCol 3.659657
3  DecisionTree       Mean 4.266184
19 RandomForest       Mode 4.367134
16 RandomForest     DropNA 4.493497
4  RandomForest     Median 4.591508
7  RandomForest       MICE 4.901004
10 RandomForest       Mean 5.073677
21 DecisionTree       MICE 5.077593
9  DecisionTree     DropNA 5.213247
12 DecisionTree       Mode 5.249752
13 RandomForest missForest 5.398429
18 DecisionTree     Median 5.537124
2           KNN     DropNA 5.548309
6  DecisionTree missForest 5.799608
8           KNN    DropCol 6.117349
17          KNN       Mean 6.118703
20          KNN missForest 6.142781
5           KNN       Mode 6.161134
11          KNN     Median 6.298989
14          KNN       MICE 6.371706
15 DecisionTree    DropCol 6.603588

Conclusions

In summary, maintaining the integrity and dependability of data analysis procedures depends on the management of missing data. This work demonstrates various methodologies that can be used to handle missing data, including listwise and feature selection deletion methods, machine learning-based algorithms like missForest, single imputation techniques such as mean, mode, median imputation, and multiple imputation using MICE. The dataset “Ozone”, from the mlbench package in R was used to demonstrate the techniques for either deleting or imputing missing data. Statistical tests and visualization techniques to help classify data as MCAR, MAR, and NMAR were also demonstrated. With respect to the “Ozone” dataset, deletion of the column Temp_EM, which contains ~38% missing data, was found to be the most effective method for handling missing data. This was demonstrated through machine learning models trained to datasets where missing data were handled using the various techniques described above. A total of seven datasets were created, and each dataset was trained and tested with KNN, decision tree, and random forest algorithms. The random forest model in combination with feature selection for handling missing data resulted in the lowest RMSE with respect to the test (unseen) dataset. This may have been the best result in this instance because this data appear to be MCAR based on statistical testing. It is best practice to test a variety of different methods, as was performed in this work, and predefine success criteria when handling missing data. To provide reliable and accurate results from data analysis, the imputation method selected should ultimately be in line with the unique features of the dataset and the analytical objectives.

References

Awan, Bennamoun, S. E. 2022. “A Reinforcement Learning-Based Approach for Imputing Missing Data.” Neural Computing and Applications 34 (12): 9701–16.
Azur, Melissa J, Elizabeth A Stuart, Constantine Frangakis, and Philip J Leaf. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research 20 (1): 40–49.
Batista, Gustavo EAPA, and Maria Carolina Monard. 2003. “An Analysis of Four Missing Data Treatment Methods for Supervised Learning.” Applied Artificial Intelligence 17 (5-6): 519–33.
Bennett, Derrick A. 2001. “How Can i Deal with Missing Data in My Study?” Australian and New Zealand Journal of Public Health 25 (5): 464–69.
Buuren, S. van. 2018. Flexible Imputation of Missing Data. CRC Press. https://books.google.com/books/about/Flexible_Imputation_of_Missing_Data_Seco.html?id=lzb3DwAAQBAJ&source=kp_book_description.
Du, Jinghan, Minghua Hu, and Weining Zhang. 2020. “Missing Data Problem in the Monitoring System: A Review.” IEEE Sensors Journal 20 (23): 13984–98.
Emmanuel, Maupong, T. 2021. “A Survey on Missing Data in Machine Learning.” J Big Data 8: 401–7.
Ghazali, Shamihah Muhammad, Norshahida Shaadan, and Zainura Idrus. 2020. “Missing Data Exploration in Air Quality Data Set Using r-Package Data Visualisation Tools.” Bulletin of Electrical Engineering and Informatics 9 (2): 755–63.
Guzel, Kaya, C. 2013. “Breast Cancer Diagnosis Based on Naïve Bayes Machine Learning Classifier with KNN Missing Data Imputation.” AWERProcedia Information Technology & Computer Science 4: 332–46.
Ibrahim, Chen, J. G. 2011. “Missing-Data Methods for Generalized Linear Models.” Journal of the American Statistical Association 100 (469).
Jerez, Molina, J. M. 2010. “Missing Data Imputation Using Statistical and Machine Learning Methods in a Real Breast Cancer Problem.” Artificial Intelligence in Medicine 50 (2): 105–15.
Li, Cheng. 2013. “Little’s Test of Missing Completely at Random.” The Stata Journal 13 (4): 795–809.
Lin, Wei-Chao, and Chih-Fong Tsai. 2020. “Missing Value Imputation: A Review and Analysis of the Literature (2006–2017).” Artificial Intelligence Review 53: 1487–1509.
Little, Todd D, Terrence D Jorgensen, Kyle M Lang, and E Whitney G Moore. 2014. “On the Joys of Missing Data.” Journal of Pediatric Psychology 39 (2): 151–62.
Mack, Christina, Zhaohui Su, and Daniel Westreich. 2018. “Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide.”
May, Michael, Christine Körner, Dirk Hecker, Martial Pasquier, Urs Hofmann, and Felix Mende. 2009. “Modelling Missing Values for Audience Measurement in Outdoor Advertising Using GPS Data.” In GI Jahrestagung, 3993–4006.
Ren, Lijuan, Tao Wang, Aicha Sekhari Seklouli, Haiqing Zhang, and Abdelaziz Bouras. 2023. “A Review on Missing Values for Main Challenges and Methods.” Information Systems, 102268.
Rubin, Donald B. 2004. Multiple Imputation for Nonresponse in Surveys. Vol. 81. John Wiley & Sons.
Schafer, Joseph L. 1997. Analysis of Incomplete Multivariate Data. CRC press.
Schefer, J. 2002. “Dealing with Missing Data.” Research Letters in the Information and Mathematical Sciences 3 (1): 153–60.
Tang, Fei. 2017. Random Forest Missing Data Approaches. University of Miami.
Zhang, Sun, X. 2024. “A Matrix Completion Method for Imputing Missing Values of Process Data.” Processes 12 (4): 659.
Zheng, Siming, Juan Zhang, and Yong Zhou. 2023. “Likelihood Identifiability and Parameter Estimation with Nonignorable Missing Data.” Canadian Journal of Statistics 51 (4): 1190–1209.